Skip to content

Add __array_ufunc__ to Series / Array #23293

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 60 commits into from
Jul 1, 2019

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Oct 23, 2018

This PR:

  • adds a basic (but incomplete) __array_ufunc__ implementation for IntegerArray (to be able to check it is correctly used from Series for ExtensionArrays)
  • adds Series.__array_ufunc__ (not yet for Index)

@TomAugspurger revived my branch, will try to work on it a bit further this afternoon

What this already does is a basic implementation of the protocol for IntegerArray for simple ufuncs (call) for all IntegerArrays, and Series dispatching to the underlying values.

One question is if we want to force EA authors to implement an __array_ufunc__ (eg by having a default implementation returning NotImplemented).

@pep8speaks
Copy link

pep8speaks commented Oct 23, 2018

Hello @jorisvandenbossche! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-07-01 18:48:08 UTC

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question is if we want to force EA authors to implement an array_ufunc (eg by having a default implementation returning NotImplemented).

Would the motivation here be to simplify pandas internal code? We can skip any checks for it?

@jorisvandenbossche
Copy link
Member Author

BTW, if you also already had some stuff on this, feel free to push the branch if I could use something of it

@jorisvandenbossche
Copy link
Member Author

Considering the IntegerArray __array_ufunc__ implementation, the question comes up to what extent do we want to integrate our _create_arithmetic_method with it.
Because the __array_ufunc__ needs to be able to do everything that _create_arithmetic_method does. So do we want to use the dunder methods created in _create_arithmetic_method inside __array_ufunc__, or do we want to base the ops dunder methods on the __array_ufunc__ (like the NDArrayOperatorsMixin does: https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.lib.mixins.NDArrayOperatorsMixin.html#numpy.lib.mixins.NDArrayOperatorsMixin)

@jorisvandenbossche
Copy link
Member Author

@TomAugspurger the first option (use __add__ inside __array_ufunc__(np.add, ..) instead of the other way around) is basically what you did for SparseArray:

special = {'add', 'sub', 'mul', 'pow', 'mod', 'floordiv', 'truediv',
'divmod', 'eq', 'ne', 'lt', 'gt', 'le', 'ge', 'remainder'}
if compat.PY2:
special.add('div')
aliases = {
'subtract': 'sub',
'multiply': 'mul',
'floor_divide': 'floordiv',
'true_divide': 'truediv',
'power': 'pow',
'remainder': 'mod',
'divide': 'div',
'equal': 'eq',
'not_equal': 'ne',
'less': 'lt',
'less_equal': 'le',
'greater': 'gt',
'greater_equal': 'ge',
}
flipped = {
'lt': '__gt__',
'le': '__ge__',
'gt': '__lt__',
'ge': '__le__',
'eq': '__eq__',
'ne': '__ne__',
}
op_name = ufunc.__name__
op_name = aliases.get(op_name, op_name)
if op_name in special and kwargs.get('out') is None:
if isinstance(inputs[0], type(self)):
return getattr(self, '__{}__'.format(op_name))(inputs[1])
else:
name = flipped.get(op_name, '__r{}__'.format(op_name))
return getattr(self, name)(inputs[0])

@codecov
Copy link

codecov bot commented Oct 23, 2018

Codecov Report

Merging #23293 into master will decrease coverage by 0.04%.
The diff coverage is 94.68%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23293      +/-   ##
==========================================
- Coverage   92.04%   91.99%   -0.05%     
==========================================
  Files         180      180              
  Lines       50714    50848     +134     
==========================================
+ Hits        46679    46777      +98     
- Misses       4035     4071      +36
Flag Coverage Δ
#multiple 90.63% <94.68%> (-0.05%) ⬇️
#single 41.86% <45.74%> (-0.1%) ⬇️
Impacted Files Coverage Δ
pandas/core/ops.py 94.81% <100%> (+0.15%) ⬆️
pandas/core/arrays/categorical.py 95.95% <100%> (+0.03%) ⬆️
pandas/core/arrays/sparse.py 94.15% <100%> (-0.05%) ⬇️
pandas/core/arrays/integer.py 95.44% <87.5%> (-0.87%) ⬇️
pandas/core/series.py 93.72% <96.87%> (+0.05%) ⬆️
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
pandas/core/internals/managers.py 96% <0%> (-0.87%) ⬇️
pandas/core/internals/blocks.py 94.38% <0%> (-0.77%) ⬇️
pandas/core/dtypes/concat.py 96.58% <0%> (-0.46%) ⬇️
pandas/core/internals/concat.py 96.48% <0%> (-0.37%) ⬇️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e955515...6e770e8. Read the comment docs.

@@ -623,6 +623,29 @@ def view(self, dtype=None):
return self._constructor(self._values.view(dtype),
index=self.index).__finalize__(self)

def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
inputs = tuple(
x._values if isinstance(x, type(self)) else x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: could move type(self) outside this expression, so that it's only evaluated once.

@TomAugspurger
Copy link
Contributor

cc @shoyer if you have thoughts here as well.

…] underlying values are object array -> not always doing the correct thing)
np.array(x) if isinstance(x, type(self)) else x
for x in inputs
)
return np.array(self).__array_ufunc__(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to call the public ufunc again here, e.g., return getattr(ufunc, method)(*inputs, **kwargs). Otherwise you don't invoke the dispatching mechanism again.

See NDArrayOperatorsMixin for an example implementation.

if (method == '__call__'
and ufunc.signature is None
and ufunc.nout == 1):
# only supports IntegerArray for now
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should check for isinstance(.., IntegerArray)?

# for binary ops, use our custom dunder methods
result = ops.maybe_dispatch_ufunc_to_dunder_op(
self, ufunc, method, *inputs, **kwargs)
if result is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why you use a different sentinel value None rather than Python's standard NotImplemented?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a specific reason, just that I didn't return anything in the maybe_dispatch_ufunc_to_dunder_op, which means the result is None (and our own dunder ops will never return None).
But can make it more explicit.

ufunc, method, *inputs, **kwargs)
if result is NotImplemented:
raise TypeError("The '{0}' operation is not supported for "
"dtype {1}.".format(ufunc.__name__, self.dtype))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of all this (checking for __array_ufunc__ and then the branch/error message), you could just reinvoke the public ufunc again (as I suggested for IntegerArray). That would also make me more confidence that the proper delegation is happening.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that sounds like a better idea :)

if result is NotImplemented:
raise TypeError("The '{0}' operation is not supported for "
"dtype {1}.".format(ufunc.__name__, self.dtype))
return self._constructor(result, index=self.index,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that not every ufunc returns one result, so you should either handle or explicitly error for the other cases. See NDArrayOperatorsMixin for how to handle this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, do you have the same problem with the current implementation of __array_wrap__ ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer my own question: yes, at least for ufuncs that return a scalar:

In [33]: s = pd.Series([1, 2, 3])

In [34]: np.add.reduce(s)
Out[34]: 
0    6
1    6
2    6
dtype: int64

In [35]: pd.__version__
Out[35]: '0.23.4'

But for ufuncs that return multiple values it seems to work with pandas master (wrapping each return element I suppose).

@@ -623,6 +623,29 @@ def view(self, dtype=None):
return self._constructor(self._values.view(dtype),
index=self.index).__finalize__(self)

def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally we recommend doing a type check and returning NotImplemented if you see an unknown type . If __array_ufunc__ always tries to do the operation, you can get non-commutative operations, e.g., a + b != b + a.

For example, consider another library like xarray that implements its own version of a pandas.Series. Series + DataArray and DataArray + Series should consistently give either Series, DataArray or an error (probably the safest choice) -- they shouldn't just return a result of the same type as the first argument.

This might even be an issue for Series + DataFrame or np.add(Series, DataFrame) -- you probably want to delegate to the DataFrame in that case, not the Series.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking somehow: the __array_ufunc__ of the underlying values we call here is already doing the check, so I don't need to do it here again.
But indeed, that will probably mess up such cases.

The problem is that for now the __array_ufunc__ uses our own implementations of all arithmetic/comparison operators, so I should somehow list here everything that all those different ops accept, which might be a bit annoying.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt you need to list types dependent on the operators -- it's probably enough to list the types that can be used in valid operations with pandas.Series.

It's true that you would possibly need to recurse into extension arrays to figure this out all the valid types, and it may not be easy to enumerate all scalar types, e.g., if IPAddressExtensionArray can handle arithmetic with IPAddress, how could pandas.Series know about that?

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it are mainly the scalar values which might be hard to list (apart from those, we would probably accept Series, Index, array, and a bunch of array-likes (list, tuple, ..) for the Series case).
But eg a Series with datetimelike values can already handle our own Timestamp, Timedelta and offset objects, numpy scalars, datetime standard library scalars, integer, ..), and as I understand those should all be listed here?
(but anyhow, if we implement an __array_ufunc__ for eg PeriodArray, we would need to do that there as well, so need to make this list in any case)

But what you mention about custom ExtensionArrays is then even another problem, that we don't know what they would accept. By "recurse into extension arrays", would you mean something like checking the ExtensionArray._HANDLED_TYPES (and then expecting EA authors to set this specific property)? Or some other way to inspect?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "recurse into extension arrays", would you mean something like checking the ExtensionArray._HANDLED_TYPES (and then expecting EA authors to set this specific property)? Or some other way to inspect?

Yeah, that's about the best I can imagine. But I agree, it doesn't seem easy! There's no general way to detect what types another object's __array_ufunc__ method can handle without actually calling it.

It is probably reasonable to trade-off correctness for generality here, but we should be cognizant that we are making this choice. Alternatively, we might maintain an explicit "blacklist" of types which pandas can't handle, though this gets tricky with non-required imports (e.g., dask and xarray objects).

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Oct 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer in light of the above, how would you define the handled types for an object dtyped Series, in which you in principle can put any python object you want? For numpy arrays, things like addition work element-wise for object dtype, so the following works:

In [101]: class A():
     ...:     def __init__(self, x):
     ...:         self.x = x
     ...:     def __add__(self, other):
     ...:         return A(self.x + other.x)
     ...:     def __repr__(self):
     ...:         return "A(x={})".format(self.x)
     ...:          

In [102]: a = np.array([A(1), A(2)], dtype=object)

In [103]: np.add(a, A(3))
Out[103]: array([A(x=4), A(x=5)], dtype=object)

In [104]: np.add(pd.Series(a), A(3))
Out[104]: 
0    A(x=4)
1    A(x=5)
dtype: object

Doing a more strict check for allowed types, then the above will no longer work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is numpy actually dealing with the above? It makes an exception for object dtype?

So two ideas that might be useful here:

  • We can make an exception for object dtype, and in that case allow more unknown objects
  • We could check for objects that implement __array_ufunc__ themselves? If they have the attribute, and are not known to us (not in our _HANDLED_TYPES list, eg dask or xarray objects), we raise NotImplemented, otherwise we pass through.
    That at least solves the problem for the more known objects that will probably have the interface, but of course not yet for custom container types that do not implement it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are good questions. In principle NumPy follows the rules described at http://www.numpy.org/neps/nep-0013-ufunc-overrides.html#behavior-in-combination-with-python-s-binary-operations but I think there is something missing for unknown scalars / object arrays...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See numpy/numpy#12258 for a description of how NumPy does things.

Copy link
Contributor

@TomAugspurger TomAugspurger Jun 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure I understand this thread: @shoyer kicked it off by saying

Normally we recommend doing a type check and returning NotImplemented if you see an unknown type

We can handle the scalar types of EAs by requiring that they implement a _HANDLED_TYPES, which would include their scalar type. Then if all the (unboxed from Series / Index) inputs are in handled types, we proceed.

object-dtype complicates things. Did we decide what to do there? Skip the check?

@jorisvandenbossche
Copy link
Member Author

@shoyer Thanks for the feedback!

@jorisvandenbossche
Copy link
Member Author

Question: do we want ufuncs to work on IntegerArray for older numpy versions? (that don't support yet __array_ufunc__ ?

@TomAugspurger for sparse, you did this I suppose by defining __array_wrap__ in terms of __array_ufunc__:

def __array_wrap__(self, array, context=None):
from pandas.core.dtypes.generic import ABCSparseSeries
ufunc, inputs, _ = context
inputs = tuple(x.values if isinstance(x, ABCSparseSeries) else x
for x in inputs)
return self.__array_ufunc__(ufunc, '__call__', *inputs)

but that seems a bit like a hack? (and I also don't fully understand it; this will compute the ufunc twice, and once on the densified data before the result of that is passed to __array_wrap__?)

@TomAugspurger
Copy link
Contributor

It is indeed a hack, since it converts the data to dense.

I'm not sure why the ufunc would be computed twice though.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Oct 24, 2018

Because it is the result of the ufunc that is passed to __array_wrap__ ?

@TomAugspurger
Copy link
Contributor

Ah, I though the input was passed... Will take a look.

@jorisvandenbossche jorisvandenbossche changed the title [WIP] Add __array_ufunc__ to Series / Array Add __array_ufunc__ to Series / Array Oct 24, 2018
@jorisvandenbossche
Copy link
Member Author

cc @jreback @jbrockmendel

@jorisvandenbossche
Copy link
Member Author

Instead of using numbers.Number, you could explicitly check for a lack of array_priority and array_ufunc attributes to indicate that something is a scalar / can be safely coerced. That is basically what NumPy does.

@shoyer that would also solve what we discussed before about for object dtype not knowing what the handled types are? (currently we are skipping the type check for object dtype data)

array can handle
2. Defer to the :class:`Series` implementatio by returning ``NotImplemented``
if there are any :class:`Series` in the ``types``. This ensures consistent
metadata handling, and associativity for binary operations.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in general the case for pandas containers, I think? (at least in the future if we add ufunc protocol to Index and DataFrame).
So I would make this more general than only Series, that the _HANDLED_TYPES should not include any pandas container types (Series, Index, DataFrame)

@@ -874,6 +875,7 @@ ExtensionArray

- Bug in :func:`factorize` when passing an ``ExtensionArray`` with a custom ``na_sentinel`` (:issue:`25696`).
- :meth:`Series.count` miscounts NA values in ExtensionArrays (:issue:`26835`)
- Added ``Series.__array_ufunc__`` to better handle NumPy ufuncs applied to Series backed by extension arrays (:issue:`23293`).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a general item for Series.__array_ufunc__ being added? Although, in principle it should not change any behaviour, as we already supported ufuncs?

def reconstruct(x):
if np.isscalar(x):
# reductions.
if mask.any():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 'scalar' branch, is that this relevant if we raise above for 'reduce' methods? (I don't think any of the non-reduce ufuncs can return a scalar

@jorisvandenbossche
Copy link
Member Author

I think this is good enough to put in the RC

@shoyer
Copy link
Member

shoyer commented Jun 30, 2019

Instead of using numbers.Number, you could explicitly check for a lack of array_priority and array_ufunc attributes to indicate that something is a scalar / can be safely coerced. That is basically what NumPy does.

@shoyer that would also solve what we discussed before about for object dtype not knowing what the handled types are? (currently we are skipping the type check for object dtype data)

Yes, I think this would be an elegant way to handle this. Here's what I wrote up last October:
numpy/numpy#12258 (comment)

It's up to you whether you want to check for __array_priority__ and __array_ufunc__ attributes. The former is just a legacy thing now from NumPy's perspective (it's impossible to map operator priorities to integers outside of a single project), but it's probably still worth checking because classes that set __array_priority__ were explicitly opting out of NumPy's automatic arithmetic operations that treat everything as a scalar. This is a good indication that they shouldn't be treated as scalars by pandas, either.

@jorisvandenbossche
Copy link
Member Author

Ah, thanks for the link to that issue. Forgot about that discussion.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some small comments; ok to merge for the rc


As part of your implementation, we require that you

1. Define a ``_HANDLED_TYPES`` attribute, a tuple, containing the types your
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we now recommeding setting __array_priority__ ?

handled_types = sum([getattr(x, '_HANDLED_TYPES', ()) for x in inputs],
self._HANDLED_TYPES + (Series,))
any_object = any(getattr(x, 'dtype', None) == 'object' for x in inputs)
# defer when an unknown object and not object dtype.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line before comments

@TomAugspurger
Copy link
Contributor

I'll update to incorporate #23293 (comment) later today. Fixing the CI for pandas and dask first.

@jreback jreback added this to the 0.25.0 milestone Jul 1, 2019
@jreback
Copy link
Contributor

jreback commented Jul 1, 2019

also, is there any way we can pin numpy 1.17rc1 in one of the builds just to be sure (as saw some other breakages on this); even though we test master.

@TomAugspurger
Copy link
Contributor

I tried to implement the array_priority / array_ufunc check. A bit rushed today so I may have messed something up.

The main wrinkle is that Series.__array_ufunc__ doesn't want to return NotImpelmented for ExtensionArrays defining __array_ufunc__. We want them to defer to us, so that we get consistent behavior for binops between Series and EA (we take care of unboxing and calling the ufunc on the array).

@jorisvandenbossche
Copy link
Member Author

If I am understanding correctly, we should only check for the presence of __array_ufunc__ / __array_priority__ on an object if it is not one of the _HANDLED_TYPES. And for series, the EA is in the handled types?

@TomAugspurger
Copy link
Contributor

Ah, OK I missed that in the discussion. Just to make sure, we only just Series._HANDLED_TYPES? We don't do the recursive getattr(input, '_HANDLED_TYPES', ())?

@TomAugspurger
Copy link
Contributor

FYI, all green here.

@jreback jreback merged commit 02b552d into pandas-dev:master Jul 1, 2019
@jreback
Copy link
Contributor

jreback commented Jul 1, 2019

thanks @jorisvandenbossche and @TomAugspurger very nice. would be even nicer to shake this down a bit in master......

@jreback
Copy link
Contributor

jreback commented Jul 1, 2019

any open issues for this?

@jorisvandenbossche
Copy link
Member Author

any open issues for this?

#22798 is related, but need to check if that can be fully closed.

Ah, OK I missed that in the discussion. Just to make sure, we only just Series._HANDLED_TYPES? We don't do the recursive getattr(input, '_HANDLED_TYPES', ())?

It might be that we should still check the _HANDLED_TYPES of underlying EAs.
Some EAs will want to handle scalars that have a __array_priority__ set (like our own scalars), although this is maybe a sign that we should not define __array_priority__ on them ...

@jorisvandenbossche jorisvandenbossche deleted the array-ufunc branch July 2, 2019 14:06
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 2, 2019 via email

@jorisvandenbossche
Copy link
Member Author

It seems we use 1000 for Series and EAs, and 100 for Timestamp and Timedelta, so that should be OK then?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 2, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants